Fred LaPolla
4/14/2020
Students will be able to:
Let’s pull in our same NYC HANES data set.
library(RCurl)
url <-getURL("https://raw.githubusercontent.com/fredwillie/RScience2020/master/NYC_HANES_DIAB.csv")
nyc <- read.csv(text = url)
nyc <- na.omit(nyc)
nyc$AGEGROUP <- factor(nyc$AGEGROUP, levels = 1:3, labels = c("Youngest", "Middle", "Aged"))
nyc$GENDER <- factor(nyc$GENDER, levels = 1:2, labels = c("male", "female"))
# Rename the HSQ_1 factor for identification
nyc$HSQ_1 <- factor(nyc$HSQ_1, levels = 1:5, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
# Rename the DX_DBTS as a factor
nyc$DX_DBTS <- factor(nyc$DX_DBTS,levels = 1:3, labels=c("Diabetes with DX","Diabetes with no DX","No Diabetes"))There are two major reasons we might want to visualize our data: exploration and communication.
In the first class, what were some of the ways we discussed to get a sense of our data? Why do we do this?
In addition to getting an understanding of our data, these tools help us understand what sort of tests are possible on our data type.
One of the first things we may want to know is “Is my data normally distributed”? Why might we want to know this?
The command hist() let’s us make a histogram:
#For today don't worry about par, this is to plot three charts side by side
threechart <-par(mfrow = c(1,3))
#this is the meat of the matter
hist(nyc$GLUCOSE)
hist(nyc$CHOLESTEROLTOTAL)
hist(nyc$LEAD)You could also give it a fill if you want:
We can also use boxplots, sometimes also called Box and Whisker or Tukey plots.
Boxplots display: the median, quartiles and outliers (which are by default 1.5 x the interquartile range above or below the quartiles). You can tell if there is relatively a lot of skew or not because the median will be near the mean.
You can also use boxplots to look at comparisons of groups.
You can also add in a notch, which if the notch overlaps is a good rule of thumb that they are not statistically significantly different (but beware of statistical significance).
Another option for looking at skew is the Quantile-Quantile Plot, or QQ Plot. This chart plots quantiles, similar to percentiles of our data against a normally distributed quantile. Baiscally, the more our data lies on a straight line, the more normal it is. Compare:
#QQNorm plots how the quantiles of our sample compare to a theoretical normal distribution
par(mfrow=c(1,2))
##QQPlot for total cholesterol
qqnorm(nyc$CHOLESTEROLTOTAL, main = "Normal Q-Q Plot Total Cholesterol")
#QQLine adds a straight line for reference
qqline(nyc$CHOLESTEROLTOTAL)
##QQPlot for Cadmium
qqnorm(nyc$CADMIUM,main= "Normal Q-Q Plot Cadmium")
qqline(nyc$CADMIUM)
Sometimes we may be running a correlation test. While running Pearson’s or a Mann-Whitney Correlation test can give you results, it can also be nice to look at an image of that correlation to see how nicely (or not) the points seem to correlate:
##
## Wilcoxon signed rank test with continuity correction
##
## data: nyc$A1C and nyc$GLUCOSE
## V = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Sometimes, for example in cases where our scale has exponentially larger outliers, it can be helpful to try to normalize our data so that the chart is more viewable. You could try a transformation like “log()”. It is helpful to do the log +1 in this case because log(0) is negative infinity
par(mfrow = c(1,2))
plot(nyc$GLUCOSE, nyc$CHOLESTEROLTOTAL)
plot(log(nyc$GLUCOSE+1), log(nyc$CHOLESTEROLTOTAL+1))In looking at gene expression data you may want to see a heatmap to see which timepoints/genes are being expressed. A heatmap is a quick way to look at large volumes of data by highlighting cells that have higher or lower values in them.
This data provided by Dr. Dolgalev is a table of RNA-seq normalized counts. These are expression values of all genes across different samples. There are wild-type and Tet2-knockout samples (4 of each).
For more details on how the table was generated, the steps are summarized here: http://bit.ly/snsdemo
## gene pan1wt pan2wt pan3ko pan4ko pan5ko pan6ko pan7wt pan8wt
## 1 0610005C13Rik 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 2 0610006L08Rik 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0610009B22Rik 35.856 40.314 35.508 37.582 38.570 38.426 41.635 40.510
## 4 0610009E02Rik 1.434 0.000 0.000 0.817 0.000 2.175 0.000 1.870
## 5 0610009L18Rik 2.868 2.573 1.732 3.268 0.000 0.000 0.000 1.246
## 6 0610009O20Rik 74.581 60.041 45.900 44.936 57.855 58.001 47.691 84.759
## [1] "data.frame"
We can see column 1 is the gene name, column 2, 3 and 8 and 9 are the Wild Type samples, and columns 4:7 are the knock out samples.
## gene pan1wt pan2wt pan7wt pan8wt pan3ko pan4ko pan5ko pan6ko
## 1 0610005C13Rik 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 2 0610006L08Rik 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0610009B22Rik 35.856 40.314 41.635 40.510 35.508 37.582 38.570 38.426
## 4 0610009E02Rik 1.434 0.000 0.000 1.870 0.000 0.817 0.000 2.175
## 5 0610009L18Rik 2.868 2.573 0.000 1.246 1.732 3.268 0.000 0.000
## 6 0610009O20Rik 74.581 60.041 47.691 84.759 45.900 44.936 57.855 58.001
dolgaMat <- as.matrix(dolgalevNormRNA[,2:9], "numeric")
rownames(dolgaMat)<- dolgalevNormRNA[,1]
head(dolgaMat)## pan1wt pan2wt pan7wt pan8wt pan3ko pan4ko pan5ko pan6ko
## 0610005C13Rik 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 0610006L08Rik 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 0610009B22Rik 35.856 40.314 41.635 40.510 35.508 37.582 38.570 38.426
## 0610009E02Rik 1.434 0.000 0.000 1.870 0.000 0.817 0.000 2.175
## 0610009L18Rik 2.868 2.573 0.000 1.246 1.732 3.268 0.000 0.000
## 0610009O20Rik 74.581 60.041 47.691 84.759 45.900 44.936 57.855 58.001
## [1] "matrix"
## [1] "numeric"
## pan1wt pan2wt pan7wt pan8wt
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 96.99 Mean : 112.84 Mean : 98.06 Mean : 82.22
## 3rd Qu.: 2.87 3rd Qu.: 2.57 3rd Qu.: 3.03 3rd Qu.: 3.12
## Max. :38552.42 Max. :83293.70 Max. :37493.25 Max. :103143.74
## pan3ko pan4ko pan5ko pan6ko
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 117.37 Mean : 112.88 Mean : 102.06 Mean : 97.49
## 3rd Qu.: 1.73 3rd Qu.: 2.45 3rd Qu.: 3.21 3rd Qu.: 2.17
## Max. :76915.11 Max. :68016.87 Max. :54551.21 Max. :41122.76
knockOut <- dolgaMat[,5:8]
wildType <- dolgaMat[,1:4]
exp_genesAll <- names(which(rowSums(dolgaMat)>15000))
exp_genesKO = names(which(rowSums(knockOut)>15000))
exp_genesWT = names(which(rowSums(wildType)>15000))
dolgaExp <- dolgaMat[exp_genesAll,]
expressedKO <- knockOut[exp_genesKO,]
expressedWT <- wildType[exp_genesWT,]## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.0.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
## ========================================
We might look at a heatmap of the highly expressed genes
Then we can take this figure, and using a command called hclust, we can pass a method we want the clustering to be done. dist means it is creating a “distance matrix” of the correlations of gene expression, which is a way of saying if we were to chart the values, how far would each be from one another. You can explor ways you can cluster by viewing ?dist. Then this
##The following is borrowed from Dr. Itai Yannai
##A correlation of the highly expressed genes in our set
C = cor(t(dolgaExp))
#then it becomes a heatmap
h <- Heatmap(C)
print(h)## We cluster the values in this set of gene expression using euclidian distance
## Basically on the backend, R is making a matrix of the values in C and then calculating
##How far they are from one another
h <- hclust(d=dist(C))
## Then we are choosing where to cut the dendrogram, here 6 "branches" or groups have
## been cut, you could also choose to cut the dendrogram by height using h instead of K
hc <- as.factor(cutree(h, k=6))
## Now we make a heatmap again using the order of the clustering
hh = ComplexHeatmap::Heatmap(C[h$order, h$order],cluster_rows=FALSE, cluster_columns=FALSE)
## Finally we add annotations to show the groups that we have made
an = HeatmapAnnotation(df = data.frame(hc[h$order]), which = 'row')
print(hh+an)We can see that our data is highly skewed so we may want to perform this on a log scale:
Maybe we might want to see something like only cases of genes that are highly correlated between groups:
dolgNew <- vector("numeric", )
corLoop<- for(i in 1:506){
correlation<-ifelse( cor( dolgaExp[i,1:4], dolgaExp[i,5:8]) > .70,
cor(dolgaExp[i,1:4],dolgaExp[i,5:8]),
9999)
dolgNew[i]<- correlation
}What are we doing here: first creating that empty vector of the length of the normalized counts. Next we are setting up a for loop to run over this length. Then we are setting up an if else to give us one of two values: if the correlation of the the two mousetypes gene expressions are over .7 then save that value, otherwise assign a 9999. We now have three types of values: NAs, correlations for “highly correlated” values or 9999s.
Now let’s make a matrix of getting rid of those with a low correlation. We are saying create a matrix. That matrix should pull from the dolgaMat matrix, but only rows that are not (!=) 9999, which was our non-correlated value above. Let’s also get rid of the NAs
Maybe we want to leave the order as is:
But we can also use several methods of hierarchical clustering in our chart. To learn more about these try: ?hclust
##
## The downloaded binary packages are in
## /var/folders/wh/kxxcb38x3970_qdg14gscpd1k2bqnj/T//Rtmpgfg5Vx/downloaded_packages
You can also use a command kmeans() for unsupervised clustering:
While base plots like those above are fine for data exploration, often we will also need to create figures for a publication.
A good option is a package called GGPlot2, which stands for Grammar of Graphics (you will never need to know that)
It is part of the tidyverse, so you should actually have it but just in case:
##
## The downloaded binary packages are in
## /var/folders/wh/kxxcb38x3970_qdg14gscpd1k2bqnj/T//Rtmpgfg5Vx/downloaded_packages
GGPlot is very common, so you should learn it, but it is also kind of weird relative to a lot of ways you might normally think about charts. GGPlot works by assigning values to aesthetic components of a chart (think x, y coordinates, colors, shapes etc), and then adds layers of shapes and labels to actually visualize these things.
This typically requires working in multiple steps.
first we name the dataset and assign our basic X Y coordinates. Not all charts will have both. Let’s start with a basic scatterplot
When we run this, nothing happens, or seems to happen.
We next need to add a layer that tells R what sort of shape to map those aesthetic attributes onto:
We could try a different “geom” but most do not actually make a lot of sense for this combination of variables:
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is **no need* to memorize these “geoms.” Instead go to the Help up on the top menu and open the Cheatsheets. There are many cheatsheets, and they add when you install packages. This is a quick reference guide.
So now if we want to make charts like those above we can:
Histograms
We can also set the binwidth
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
Or
In R Error bars are also added als a layer. In this code we are first making a matrix of mean cholesterol levels by gender
## Getting mean cholesterol and setting it as a matrix
genderChol <- as.matrix(by(nyc$CHOLESTEROLTOTAL, nyc$GENDER, mean))
## we actually want a data frame and are setting it here. The output of "by" cannot directly become a dataframe.
genderChol <- as.data.frame(genderChol)
## Naming the column of Mean Cholesterol
names(genderChol)<- "MeanCholesterol"
## creating a vector of gender labels
sex <- c("male","female")
## Combining the two, previously m and f were row names, not actual data
genderChol <- cbind(genderChol, sex)
## Getting the standard deviation of cholesterol by gender and adding it to our dataframe
standdev <- as.vector(by(nyc$CHOLESTEROLTOTAL,nyc$GENDER, sd))
genderChol <- cbind(genderChol, standdev)
##Greating a standard error vector
se <- vector("numeric", 2)
se[1] <- (genderChol$standdev[1])/sqrt(length(which(nyc$GENDER=="male")))
se[2]<- (genderChol$standdev[2])/sqrt(length(which(nyc$GENDER=="female")))
genderChol <- cbind(genderChol, se)
## creating an upper and lower conf. interval
genderChol[,5]<- genderChol$MeanCholesterol+genderChol$se
genderChol[,6]<- genderChol$MeanCholesterol-genderChol$se
ggplot(genderChol, aes(x = sex, y = MeanCholesterol) )+geom_bar(stat="identity", fill = "steelblue") + geom_errorbar(y = genderChol$MeanCholesterol,
ymin = genderChol$V6,
ymax = genderChol$V5,
stat= "identity", inherit.aes = TRUE, color = "orange",
alpha = .9, size = 1, linetype = 1, width = .2)
So far we have just found a relatively complex way to make the same plots that hist() and boxplot() made.
One of the first things we might try is formatting by color. Color could serve two purposes: meaning can be encoded in color or for design. By encoding meaning, I mean that the color tells us information, for example in the heatmap.
A quirk of ggplot is that colors of points are encoded col, but if you want bars to be filled, you must use fill:
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
You can also use color blind friendly palettes, like those offered by the package colorbrewer2
##
## The downloaded binary packages are in
## /var/folders/wh/kxxcb38x3970_qdg14gscpd1k2bqnj/T//Rtmpgfg5Vx/downloaded_packages
Then:
We may also want to change up the scales. The plot we just made aonly had tick marks every 100, but that is too large a scale for cholesterol. the seq() command takes in a starting value, a finishing value and a number to count by.
We could do the same for the y scale:
We can also relabel the labels to make it more presentable for a paper, poster or talk. Perhaps you are noticing a trend that at first seems strange in ggplot but makes life easier: we keep making changes by adding arguments on to the end. This means once you get your baseline plot drawn you can then experiment on customizing for publication.
We can also customize the overall “look” of the chart with a theme. As you start adding “theme_” R will suggest options, try them out to see which you like. The second theme argument removes a redundant label on the legend.